Combined Celery+Julia Pods and Cron-job rolling restarts #655

Bill-Becker · 2025-08-23T20:47:04Z

This PR address the Kubernetes server issues of:

Unbalanced Julia loading where Celery jobs do not evenly distribute jobs to Julia
-Combine Celery+Julia containers together in one pod, while leaving separate Julia-only pods for non-celery Julia API calls
-Tested on staging API deploy of this branch - POST requests to /job endpoint only go to these pods (not the Julia-only pods), and Julia-only pods get all the non-celery API requests. You can see the Julia container logs of the Celery+Julia pods only if you click on the pod and then view the Julia logs. Kubernetes seems to balance Celery jobs OK but sometimes stacks multiple consecutive requests on the same pod even when the other one does not have celery jobs running.
Memory growth from Julia runs
-Cron job rolling restarts of Julia containers

Also:
3. Update production and staging resources
4. Align number of gunicorn workers with max Django pod CPUs

… /job runs

…ith gunicorn

Bill-Becker · 2025-08-23T21:03:42Z

@GUI let me know if you have thoughts on the TODO for Jenkinsfile-restart-celery-julia.yaml, mentioned in the PR description.

This will log requests hitting the Julia HTTP server making it a little more obvious what's happening in the logs.

- Add some missing variables needed even for this basic restart task. - Wait for rollout restarts to complete so we know if they've been successful or not.

GUI

@Bill-Becker: I haven't analyzed the performance of things after this change, but I think the basic change to have a 1:1 relationship between the Celery workers and the Julia pods looks good. I'm still not sure this will totally solve the performance issues you've seen, but it will hopefully at least alleviate the potential imbalance of load on Julia containers given how the queuing currently works.

Regarding the restart Jenkins task, I've added configuration for that (https://github.nrel.gov/TADA/tada-jenkins-config/pull/22), so you should now find a "restart-celery-julia" job in Jenkins. I've updated the Jenkinsfile in this branch to what I believe will be a functional version of what you were after. I was able to run it successfully against this branch, but I believe once it lands on master, then the cron-job style should kick in.

More generally, there might be more Kubernetes-native ways to accomplish this type of restart for misbehaving pods that could be more resilient to various issues. For example, Kubernetes health checks and memory limits should be configurable so that pods automatically restart once they exceed the memory threshold and/or if they are detected as unhealthy. For example, with the latest Redis issue the past week where things stopped working at a specific time, you'd maybe have to wait up to a day for this scheduled task to kick in and restore functionality. If you had Kubernetes health checks configured on the pods, then Kubernetes could restart those as soon as it detects a failure. However, this obviously may require more work to implement these type of accurate health checks, and all of these approaches are still sort of bandaids on whatever the underlying issues are. And I think you've maybe explored memory limits before, but I know all of this has been particularly funky, so I'm definitely not familiar enough with the ins-and-outs of this application to really know what's going on. But if these scheduled restarts can help, then I think hopefully the job is at least setup now in Jenkins to execute.

Bill-Becker added 9 commits August 23, 2025 09:57

Add Jenkins pipeline for rolling restart of julia/celery containers

0a21ca1

Add tada-jenkins-library to new Jenkinsfile pipeline file

1d7ce9e

Update cron time for restart-celery-julia

f2a40ad

Add Julia container to the Celery pod for linking celery-to-julia for…

8ea0756

… /job runs

Update deploy resource values and reduce django workers from 8 to 4 w…

9a8f7f1

…ith gunicorn

Update celery pod to use localhost for juliaHost

d30a32c

Fix JULIA_HOST name for local celery to Julia comm

bcf9726

Fix indention of the celery container env section

bae2a77

Change JULIA_HOST back to just "localhost"

83d0b3e

Bill-Becker requested a review from GUI August 23, 2025 21:02

GUI added 3 commits August 30, 2025 10:45

Add HTTP access logging to Julia's HTTP server.

cd86fa6

This will log requests hitting the Julia HTTP server making it a little more obvious what's happening in the logs.

Attempt to setup Jenkinsfile to restart deployments

3d3998e

Fixes for restart task

e161e6d

- Add some missing variables needed even for this basic restart task. - Wait for rollout restarts to complete so we know if they've been successful or not.

GUI approved these changes Aug 30, 2025

View reviewed changes

Bill-Becker merged commit ef657f7 into develop Sep 2, 2025
1 check passed

Bill-Becker mentioned this pull request Sep 2, 2025

K8S Deploy Updates: Combine Celery+Julia, Cron Restart of Pods #658

Merged

Bill-Becker deleted the celery-julia branch September 25, 2025 03:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combined Celery+Julia Pods and Cron-job rolling restarts #655

Combined Celery+Julia Pods and Cron-job rolling restarts #655

Uh oh!

Bill-Becker commented Aug 23, 2025 •

edited

Loading

Uh oh!

Bill-Becker commented Aug 23, 2025

Uh oh!

GUI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Combined Celery+Julia Pods and Cron-job rolling restarts #655

Combined Celery+Julia Pods and Cron-job rolling restarts #655

Uh oh!

Conversation

Bill-Becker commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bill-Becker commented Aug 23, 2025

Uh oh!

GUI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Bill-Becker commented Aug 23, 2025 •

edited

Loading